1. Problem staterment

  1. predict which customer is more likely to purchase the newly introduced travel package.
  2. Make recommendations to the Policy Maker and Marketing Team to target the potential customer who is going to purchase the newly introduced travel package.

Import libraries

2. Load data

3.Exploratory Data analysis

3.1 shape and data type

Insights:

  1. There are missing values in columns TypeofContact, DurationOfPitch,NumberOfFollowups, PreferredPropertyStar, NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome.
  2. We can convert the object type columns to categories.

Fixing the data types

we can see that the memory usage has decreased from 763.9 KB to 564.4 KB, this technique is generally useful for bigger datasets.

3.2 Basic summary statistics and consequences

Central Tendency of data

Skewness

Insights:

  1. CustomerID: column can be dropped as it is simple Id.
  2. Age: minmum age is 18 and maximum age is 61 with mean 37.6 and positively skewed.
  3. ProdTaken: product taken only have 2 values 0 and 1. it is acategorical variable.
  4. CityTier : city tier has min 1 and max 3. As per the data disctionary, it is a categorical variable.
  5. DurationOfPitch : min is 5 and max is 127 with mean value is 15.4. this variable is positively skewed and have outliers.
  6. NumberOfPersonVisiting : Minimum is 1 and maximum is 5. This column has positive skewness.
  7. NumberOfFollowups : Minimum is 1 and maximum is 6. This column seems negatively skewed.
  8. PreferredPropertyStar : Minimum is 3 and maximum is 5. This column has positive skewness. It is categorical variable.
  9. NumberOfTrips : minimum is 1 and maximum is 22. Also 25% an 50% are 2 and 3 respectively. This column definetely has positive skewness and outliers.
  10. PitchSatisfactionScore: min is 1 and max is 5.
  11. MonthlyIncome : min is 1000 and max is 98678.0. it is positively skewed and have outliers.
  12. Passport,PitchSatisfactionScore,OwnCar,ProdTaken,cityTier: all of these variables are categorical.
  13. ProdTaken : is target variable.

Changing data type of Passport,PitchSatisfactionScore,OwnCar,ProdTaken,cityTier to category

Insights

  1. TypeofContact: Almost 70% customer was contacted by Self Enquiry. It has some missing values.
  2. Occupation: Majority of customers are Salaried.
  3. Gender: There are more male customers than female customers
  4. ProductPitched: Majority of customers got pitch for Basicpackage.
  5. MaritalStatus: Majority of customers are Married.
  6. Designation: Most frequent designation is Executive.

Checking caterogies for categorical variables

3.3 Preprocessing of Data

3.4 Univariate and Bivariate Analysis

3.4.1 Univariant Plots of continuous variables

plots for Age

Insights:

Age : There is no outliers in experience variable but it is positively skewed.

Plot for Monthly Income

Insights:

MonthlyIncome : Income variable has outliers and is positively skewed. We need to handle outliers and skewness.

Plot for DurationOfPitch

Insights

DurationOfPitch : It has outliers and is highly skewness. Median for this variable is 13.

Plot for NumberOfPersonVisiting

Insights:

NumberOfPersonVisiting : NumberOfPersonVisiting variable has outlier at 5 and is also positive skewed. Most frequent value of NumberOfPErsonVisiting is 3.

Plot for NumberOfFollowups

Insights:

NumberOFFollowUps : it has outliers and positive skewness.Most frequent value of numberOfFollowUps is 4. It has outlier at 1 and 6.

plot for NumberOfTrips

Insights:

numberOfTrips : it has outliers and positive skewness. Most frequent value of numberOfTrips is 2. There are 4 outliers at number of trips 19,20,21,22 .

3.4.b Univariant plot for categorical variables

Plot for ProdTaken

Plot for TypeofContact

Plot for CityTier

65.3% of customer are from citytier 1 followed by 30.7% customers from citytier 3.

Plot for Occupation

Plot for Gender

Plot for ProductPitched

Plot for MaritalStatus

Plot for Passport

plot for PitchSatisfactionScore

plot for OwnCar

plot for Designation

3.4.c Bivariant Plots

pair plot

plotting relationship of continuous variables with each other and personal_Loan variable

3.4.c HeatMap

3.4.e Plotting relationship of customer's information with ProdTaken

3.4.f plot between monthly income, age and ProdTaken

3.4.g Plotting relationship of customer's travel history with ProdTaken

3.4.h Plot between customeer interaction and ProdTaken

3.4.i Category plot - Plot of NumberOfPersonVisiting, MonthlyIncome and Occupation with ProdTaken

3.4.j Category plot -Plot between prodTaken and MaritalStatus, ProductPitched,PreferredPropertyStar

3.4.k Category plot - Plot between prodTaken and Occupation, ProductPitched,PreferredPropertyStar

3.4.l Category plot - Plot of MarritalStatus, MonthlyIncome and ProdPitched with ProdTaken

3.4.m Category plot - Plot of MarritalStatus, age and ProdPitched with ProdTaken

4. Insights based on EDA

Marrital Status Specific:

Package Specific :

3. Data Pre-processing

3. 1. Missing value Treatment

Treating Missing value of continuous variable (DurationOfPitch, MonthlyIncome, Age)

Treating Missing value of continuous variable (NumberOfTrips, NumberOfChildrenVisiting, NumberOfFollowups, PreferredPropertyStar, TypeofContact)

Outlier Treatment and Feature Engineering (scaling)

3.2 Scaling of Age and MonthlyIncome

Robust transfromation of Age

Robust transformation of monthly income

3.3 Outlier treatment for MonthlyIncome,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,NumberOfTrips

3.4 Prepare data for model building

create dummies variables

Splitting Test and Training data

4. Model Building and evaluating performance

* Funtions to show different metrices and confusion matrix

Build Decision Tree Model

Confusion Matrix -

Bagging Classifier

Bagging Classifier with weighted decision tree

Random Forest Classifier

Random forest with class weights

Performance improvement for bagging : Tuning Models - Using GridSearch for Hyperparameter tuning model

Tuning Decision Tree

Tuning Bagging Classifier

Tuning Random Forest

Boosting Models

Gradient Boosting Classifier

XGBoost Classifier

Model performance improvement for Boosting models: Using GridSearch for Hyperparameter tuning model

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Feature importance of XGB model

Stacking model

Now, let's build a stacking model with the tuned models - decision tree, random forest and gradient boosting, then use XGBoost to get the final prediction.

Comparing all the models - Model performance evaluation

Best Model:

After comparing these models, Tunned XGB model turned out to be the best model. First it has good recall metrics for both train and test data. Also difference between all of the metrics for train and test data is minimum among all of the model. ** Final model performanceee can be improved by changing different hyper parameters.

Feature importance of Best model- XGB tunned

Actionable insights and Recommendation for business